The objective of this regression analysis is to explore how gender composition, occupation types, and state-level differences influence job salary levels in the U.S. labor market. By applying linear regression and random forest models, we aim to identify key features that contribute to wage disparities and assess the predictive power of gender ratio as a factor.
2 ✍️ Model Inputs and Methodology
We built two regression models—Multiple Linear Regression and Random Forest—to predict salary using three inputs:
Female Ratio: the share of female workers in each occupation‑state cell
State: one‑hot encoded dummy variables for each state
Occupation: one‑hot encoded dummy variables for each broad occupational group
Code
import pandas as pdimport numpy as npfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.metrics import mean_squared_error, r2_scoreimport plotly.express as pximport pandas as pdxlsx_path ="~/ad688-employability-sp25A1-group8-1/data/gender.xlsx"job_posting_path ="~/ad688-employability-sp25A1-group8-1/job_postings.csv"df_gender = pd.read_excel(xlsx_path, sheet_name="2023", engine="openpyxl")df_gender["female_ratio"] = df_gender["women"] / df_gender["total"]df_jobs = pd.read_csv(job_posting_path)df_jobs["NAICS2"] = pd.to_numeric(df_jobs["NAICS2"], errors="coerce")# 2. NAICS2 → Occupationnaics_to_occupation = {11: "Farming, fishing, and forestry occupations",21: "Natural resources, construction, and maintenance occupations",22: "Production, transportation, and material moving occupations",23: "Construction and extraction occupations",31: "Production, transportation, and material moving occupations",42: "Sales and office occupations",44: "Sales and office occupations",48: "Production, transportation, and material moving occupations",51: "Computer and mathematical occupations",52: "Business and financial operations occupations",53: "Sales and office occupations",54: "Professional and related occupations",55: "Management occupations",56: "Office and administrative support occupations",61: "Education, training, and library occupations",62: "Healthcare practitioners and technical occupations",71: "Arts, design, entertainment, sports, and media occupations",72: "Food preparation and serving related occupations",81: "Personal care and service occupations",92: "Public Administration",99: "Unclassified"}df_jobs["Occupation"] = df_jobs["NAICS2"].map(naics_to_occupation)df_merged = df_jobs.merge( df_gender[["occupation","female_ratio"]], left_on="Occupation", right_on="occupation", how="left")df_merged["gender_category"] = df_merged["female_ratio"].apply(lambda x: "Female-dominated"if x>=0.55else ("Male-dominated"if x<=0.45else"Mixed"))df_reg = df_merged.dropna(subset=["SALARY","female_ratio","STATE_NAME","Occupation"])X = df_reg[["female_ratio","STATE_NAME","Occupation"]]X = pd.get_dummies(X, columns=["STATE_NAME","Occupation"], drop_first=True)y = df_reg["SALARY"]X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.30, random_state=688)models = {"LinearRegression": LinearRegression(),"RandomForest": RandomForestRegressor(n_estimators=100, random_state=688)}results = {}for name, model in models.items(): model.fit(X_train, y_train) y_pred = model.predict(X_test) results[name] = {"RMSE": np.sqrt(mean_squared_error(y_test, y_pred)),"R2": r2_score(y_test, y_pred) }for name, mets in results.items():print(f"{name:17s} → RMSE: {mets['RMSE']:.2f}, R²: {mets['R2']:.3f}")corr = df_reg[["SALARY","female_ratio"]].corr().iloc[0,1]print(f"\nCorrelation(SALARY, female_ratio): {corr:.3f}")rf = models["RandomForest"]importances = pd.Series(rf.feature_importances_, index=X_train.columns)importances = importances.sort_values(ascending=False).head(10)print("\nTop 10 feature importances from RandomForest:")print(importances)
LinearRegression → RMSE: 42525.67, R²: 0.121
RandomForest → RMSE: 42364.81, R²: 0.127
Correlation(SALARY, female_ratio): -0.185
Top 10 feature importances from RandomForest:
female_ratio 0.474548
Occupation_Education, training, and library occupations 0.047786
STATE_NAME_California 0.042036
Occupation_Business and financial operations occupations 0.030760
STATE_NAME_New York 0.024201
STATE_NAME_Washington 0.019251
STATE_NAME_Texas 0.018477
Occupation_Computer and mathematical occupations 0.014710
STATE_NAME_Virginia 0.013087
STATE_NAME_Oregon 0.012776
dtype: float64
The Pearson correlation between salary and female_ratio is –0.182, indicating a modest negative relationship: occupation/state cells with higher female shares tend to pay slightly less on average.
3 🔍 Implications for Job Seekers
Gender Composition & Pay Gap
Higher female representation correlates with lower average pay, reflecting occupational gender segregation and compensation gaps.
Female job seekers might consider targeting occupations or regions with more balanced—or male‑dominated—workforces to maximize compensation potential.
Geographic Differences
Roles in California, New York, and Washington tend to pay above the national reference level. If relocation is an option, applying in these states may yield higher offers.
Occupational Targets
Occupations such as education/training and financial operations rank highly in feature importance, suggesting they are particularly predictive of salary.
Technical and professional categories (e.g., computer/math, professional services) also show positive contributions—candidates with skills in these areas may command higher salaries.